GWU Data Science Datathon

DC Crash Data Analysis

Tanaya Kavathekar

Abstract

Road accidents has been a critical problem as every year more than 1.2M people die across the globle. There is a pressing need to make use of the data and understand the underlying cause of problem. Road safety issues are complex. There are significant differences in policies within and across the countries. In this analysis, the data from Metropolitan Police Department's (MPD) crash data management system (COBALT) is studied to find relationship between fatality and independent features. The crash data is for DC state.

Each year, more than 1.2 million people die across the globe due to road crashes; there is a pressing need to understand the underlying cause of the problem. As road safety issues are complex; it involves multi-sectorial ranging from the public, stakeholders to the policy makers. Significant differences exist both across and within countries and therefore policies and interventions need to be adapted to the local environment. The effectiveness of interventions requires a multi-disciplinary approach which include enforcement, engineering and psychological and education approaches. While the resources are limited, road safety interventions must not only address the sustainability of the outcomes but also the cost-effectiveness to implement and maintain it. More important, interventions must be evidence-based and can be evaluated over time before it is translated into policy. Hence, the research cannot be done in silo for better addressing the complexity of road safety issues. For sustainability, road safety interventions need to be guided and governed by policy in the implementation and development.

Pipeline

Data Preprocessing

In [16]:
data.describe()
Out[16]:
id crime id person id age
count 5.963810e+05 5.963810e+05 5.963810e+05 426744.000000
mean 4.384924e+08 2.672116e+07 8.506922e+07 38.668302
std 1.721813e+05 1.238390e+06 8.613766e+06 20.897059
min 4.370014e+08 2.341134e+07 1.045383e+07 -7990.000000
25% 4.383433e+08 2.532167e+07 8.474899e+07 27.000000
50% 4.384924e+08 2.680585e+07 8.497752e+07 37.000000
75% 4.386415e+08 2.769386e+07 8.712287e+07 51.000000
max 4.387906e+08 2.872803e+07 9.077153e+07 237.000000

Outlier treatment

Age variables has lot of outliers. From the box plot values below 0 and above 85 are considered as outliers and replaced by na

In [17]:
data['age'] = np.where(data['age']<1, np.nan, data['age'])
fig = px.histogram(data, x="age", marginal="box")
fig.show()

Missings values detection and treatment

Only age column shows 30% of missing data which is replaced by 0

Column Creation

  • Mask columns boolean fields fatal, impaired, speeding and ticket issued to 1s and 0s
  • Fatality Rate - at an accident level, number of people faced fatal accidents/number of total people involved in the accidents
  • Severity level - Club columns for minor, major and fatal accidents with indicator labels
  • Map US states into 4 regions designated by US census

EDA

In this analysis, fatal is the target variable. As indicated below the dataset is heavily imbalanced which is treated futhur. Only 0.06% (~417) are fatal

In [24]:
print("Percentage of non fatal class {} and fatal class {} ".format(
    round((data.groupby(['fatal']).agg({'fatal':'count'})['fatal'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['fatal']).agg({'fatal':'count'})['fatal'][1]/ data.shape[0])*100, 2)))

print("Percentage of non major class {} and major class {} ".format(
    round((data.groupby(['major injury']).agg({'major injury':'count'})['major injury'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['major injury']).agg({'major injury':'count'})['major injury'][1]/ data.shape[0])*100, 2)))

print("Percentage of non minor class {} and minor class {} ".format(
    round((data.groupby(['minor injury']).agg({'minor injury':'count'})['minor injury'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['minor injury']).agg({'minor injury':'count'})['minor injury'][1]/ data.shape[0])*100, 2)))
Percentage of non fatal class 99.93 and fatal class 0.07 
Percentage of non major class 96.42 and major class 3.58 
Percentage of non minor class 88.93 and minor class 11.07 

How many people involved in the accidents suffered injury?

In [25]:
crime = pd.DataFrame(data[['crime id', 'severity']].value_counts())
crime = crime.reset_index()
crime.columns = ['crime id', 'severity', 'count of people']
fig = px.histogram(crime, x="count of people", color='severity', marginal="box")
fig.show()
In [ ]: